In [290]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import os
import pandas_profiling

#to get all columns in dataframe on page
pd.set_option('display.max_columns', None)

import PyPDF2

from scipy.stats import kurtosis
from scipy.stats import skew

from sklearn import metrics
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import MinMaxScaler
In [291]:
#listing and evaluating each file
for f in os.listdir():
    print(f.ljust(30) +"--" + str(round(os.path.getsize(f) / 1000000, 2)) + 'MB')
.ipynb_checkpoints            --0.0MB
Data - Parkinsons             --0.04MB
Data - parkinsons.names       --0.0MB
LETTER G - Z.pdf              --0.48MB
Problem Statement - Ensemble.pdf--0.5MB
Project_Parkinson.ipynb       --4.89MB
In [292]:
filename = 'Data - parkinsons.names'
file = open(filename,mode='r')
text = file.read()
file.close()
print(text)
Title: Parkinsons Disease Data Set

Abstract: Oxford Parkinson's Disease Detection Dataset

-----------------------------------------------------	

Data Set Characteristics: Multivariate
Number of Instances: 197
Area: Life
Attribute Characteristics: Real
Number of Attributes: 23
Date Donated: 2008-06-26
Associated Tasks: Classification
Missing Values? N/A

-----------------------------------------------------	

Source:

The dataset was created by Max Little of the University of Oxford, in 
collaboration with the National Centre for Voice and Speech, Denver, 
Colorado, who recorded the speech signals. The original study published the 
feature extraction methods for general voice disorders.

-----------------------------------------------------

Data Set Information:

This dataset is composed of a range of biomedical voice measurements from 
31 people, 23 with Parkinson's disease (PD). Each column in the table is a 
particular voice measure, and each row corresponds one of 195 voice 
recording from these individuals ("name" column). The main aim of the data 
is to discriminate healthy people from those with PD, according to "status" 
column which is set to 0 for healthy and 1 for PD.

The data is in ASCII CSV format. The rows of the CSV file contain an 
instance corresponding to one voice recording. There are around six 
recordings per patient, the name of the patient is identified in the first 
column.For further information or to pass on comments, please contact Max 
Little (littlem '@' robots.ox.ac.uk).

Further details are contained in the following reference -- if you use this 
dataset, please cite:
Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2008), 
'Suitability of dysphonia measurements for telemonitoring of Parkinson's disease', 
IEEE Transactions on Biomedical Engineering (to appear).

-----------------------------------------------------

Attribute Information:

Matrix column entries (attributes):
name - ASCII subject name and recording number
MDVP:Fo(Hz) - Average vocal fundamental frequency
MDVP:Fhi(Hz) - Maximum vocal fundamental frequency
MDVP:Flo(Hz) - Minimum vocal fundamental frequency
MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several 
measures of variation in fundamental frequency
MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude
NHR,HNR - Two measures of ratio of noise to tonal components in the voice
status - Health status of the subject (one) - Parkinson's, (zero) - healthy
RPDE,D2 - Two nonlinear dynamical complexity measures
DFA - Signal fractal scaling exponent
spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation 

-----------------------------------------------------

Citation Request:

If you use this dataset, please cite the following paper: 
'Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection', 
Little MA, McSharry PE, Roberts SJ, Costello DAE, Moroz IM. 
BioMedical Engineering OnLine 2007, 6:23 (26 June 2007)



  • This document contains the details on data set
In [293]:
filename = 'LETTER G - Z.pdf'
# creating an object 
file = open(filename, 'rb')

# creating a pdf reader object
fileReader = PyPDF2.PdfFileReader(file)

# print the number of pages in pdf file
print(fileReader.numPages)
10
In [294]:
pageObj = fileReader.getPage(0)
pageObj.extractText()
Out[294]:
'Colour Letters G,H in red colour'
In [295]:
for i in range(fileReader.numPages):
    pageObj = fileReader.getPage(i)
    print(pageObj.extractText())
Colour Letters G,H in red colour
Colour Letters I,J in red colour
Colour Letters K,L in red colour
Colour Letters M,N in red colour
Colour Letters O,P in blue colour
Colour Letters Q,R in blue colour
Colour Letters S,T in blue colour



In [296]:
from IPython.display import IFrame
IFrame(filename, width=600, height=900)
Out[296]:
  • This pdf contains the letter details
In [297]:
data_p=pd.read_csv('Data - Parkinsons')
print("Shape:",data_p.shape)
print("Contains 195 rows and 24 columns")
data_p.head(10)
Shape: (195, 24)
Contains 195 rows and 24 columns
Out[297]:
name MDVP:Fo(Hz) MDVP:Fhi(Hz) MDVP:Flo(Hz) MDVP:Jitter(%) MDVP:Jitter(Abs) MDVP:RAP MDVP:PPQ Jitter:DDP MDVP:Shimmer MDVP:Shimmer(dB) Shimmer:APQ3 Shimmer:APQ5 MDVP:APQ Shimmer:DDA NHR HNR status RPDE DFA spread1 spread2 D2 PPE
0 phon_R01_S01_1 119.992 157.302 74.997 0.00784 0.00007 0.00370 0.00554 0.01109 0.04374 0.426 0.02182 0.03130 0.02971 0.06545 0.02211 21.033 1 0.414783 0.815285 -4.813031 0.266482 2.301442 0.284654
1 phon_R01_S01_2 122.400 148.650 113.819 0.00968 0.00008 0.00465 0.00696 0.01394 0.06134 0.626 0.03134 0.04518 0.04368 0.09403 0.01929 19.085 1 0.458359 0.819521 -4.075192 0.335590 2.486855 0.368674
2 phon_R01_S01_3 116.682 131.111 111.555 0.01050 0.00009 0.00544 0.00781 0.01633 0.05233 0.482 0.02757 0.03858 0.03590 0.08270 0.01309 20.651 1 0.429895 0.825288 -4.443179 0.311173 2.342259 0.332634
3 phon_R01_S01_4 116.676 137.871 111.366 0.00997 0.00009 0.00502 0.00698 0.01505 0.05492 0.517 0.02924 0.04005 0.03772 0.08771 0.01353 20.644 1 0.434969 0.819235 -4.117501 0.334147 2.405554 0.368975
4 phon_R01_S01_5 116.014 141.781 110.655 0.01284 0.00011 0.00655 0.00908 0.01966 0.06425 0.584 0.03490 0.04825 0.04465 0.10470 0.01767 19.649 1 0.417356 0.823484 -3.747787 0.234513 2.332180 0.410335
5 phon_R01_S01_6 120.552 131.162 113.787 0.00968 0.00008 0.00463 0.00750 0.01388 0.04701 0.456 0.02328 0.03526 0.03243 0.06985 0.01222 21.378 1 0.415564 0.825069 -4.242867 0.299111 2.187560 0.357775
6 phon_R01_S02_1 120.267 137.244 114.820 0.00333 0.00003 0.00155 0.00202 0.00466 0.01608 0.140 0.00779 0.00937 0.01351 0.02337 0.00607 24.886 1 0.596040 0.764112 -5.634322 0.257682 1.854785 0.211756
7 phon_R01_S02_2 107.332 113.840 104.315 0.00290 0.00003 0.00144 0.00182 0.00431 0.01567 0.134 0.00829 0.00946 0.01256 0.02487 0.00344 26.892 1 0.637420 0.763262 -6.167603 0.183721 2.064693 0.163755
8 phon_R01_S02_3 95.730 132.068 91.754 0.00551 0.00006 0.00293 0.00332 0.00880 0.02093 0.191 0.01073 0.01277 0.01717 0.03218 0.01070 21.812 1 0.615551 0.773587 -5.498678 0.327769 2.322511 0.231571
9 phon_R01_S02_4 95.056 120.103 91.226 0.00532 0.00006 0.00268 0.00332 0.00803 0.02838 0.255 0.01441 0.01725 0.02444 0.04324 0.01022 21.862 1 0.547037 0.798463 -5.011879 0.325996 2.432792 0.271362
In [298]:
data_p.tail(10)
Out[298]:
name MDVP:Fo(Hz) MDVP:Fhi(Hz) MDVP:Flo(Hz) MDVP:Jitter(%) MDVP:Jitter(Abs) MDVP:RAP MDVP:PPQ Jitter:DDP MDVP:Shimmer MDVP:Shimmer(dB) Shimmer:APQ3 Shimmer:APQ5 MDVP:APQ Shimmer:DDA NHR HNR status RPDE DFA spread1 spread2 D2 PPE
185 phon_R01_S49_3 116.286 177.291 96.983 0.00314 0.00003 0.00134 0.00192 0.00403 0.01564 0.136 0.00667 0.00990 0.01691 0.02001 0.00737 24.199 0 0.598515 0.654331 -5.592584 0.133917 2.058658 0.214346
186 phon_R01_S49_4 116.556 592.030 86.228 0.00496 0.00004 0.00254 0.00263 0.00762 0.01660 0.154 0.00820 0.00972 0.01491 0.02460 0.01397 23.958 0 0.566424 0.667654 -6.431119 0.153310 2.161936 0.120605
187 phon_R01_S49_5 116.342 581.289 94.246 0.00267 0.00002 0.00115 0.00148 0.00345 0.01300 0.117 0.00631 0.00789 0.01144 0.01892 0.00680 25.023 0 0.528485 0.663884 -6.359018 0.116636 2.152083 0.138868
188 phon_R01_S49_6 114.563 119.167 86.647 0.00327 0.00003 0.00146 0.00184 0.00439 0.01185 0.106 0.00557 0.00721 0.01095 0.01672 0.00703 24.775 0 0.555303 0.659132 -6.710219 0.149694 1.913990 0.121777
189 phon_R01_S50_1 201.774 262.707 78.228 0.00694 0.00003 0.00412 0.00396 0.01235 0.02574 0.255 0.01454 0.01582 0.01758 0.04363 0.04441 19.368 0 0.508479 0.683761 -6.934474 0.159890 2.316346 0.112838
190 phon_R01_S50_2 174.188 230.978 94.261 0.00459 0.00003 0.00263 0.00259 0.00790 0.04087 0.405 0.02336 0.02498 0.02745 0.07008 0.02764 19.517 0 0.448439 0.657899 -6.538586 0.121952 2.657476 0.133050
191 phon_R01_S50_3 209.516 253.017 89.488 0.00564 0.00003 0.00331 0.00292 0.00994 0.02751 0.263 0.01604 0.01657 0.01879 0.04812 0.01810 19.147 0 0.431674 0.683244 -6.195325 0.129303 2.784312 0.168895
192 phon_R01_S50_4 174.688 240.005 74.287 0.01360 0.00008 0.00624 0.00564 0.01873 0.02308 0.256 0.01268 0.01365 0.01667 0.03804 0.10715 17.883 0 0.407567 0.655683 -6.787197 0.158453 2.679772 0.131728
193 phon_R01_S50_5 198.764 396.961 74.904 0.00740 0.00004 0.00370 0.00390 0.01109 0.02296 0.241 0.01265 0.01321 0.01588 0.03794 0.07223 19.020 0 0.451221 0.643956 -6.744577 0.207454 2.138608 0.123306
194 phon_R01_S50_6 214.289 260.277 77.973 0.00567 0.00003 0.00295 0.00317 0.00885 0.01884 0.190 0.01026 0.01161 0.01373 0.03078 0.04398 21.209 0 0.462803 0.664357 -5.724056 0.190667 2.555477 0.148569
In [299]:
df=data_p
In [300]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              195 non-null    object 
 1   MDVP:Fo(Hz)       195 non-null    float64
 2   MDVP:Fhi(Hz)      195 non-null    float64
 3   MDVP:Flo(Hz)      195 non-null    float64
 4   MDVP:Jitter(%)    195 non-null    float64
 5   MDVP:Jitter(Abs)  195 non-null    float64
 6   MDVP:RAP          195 non-null    float64
 7   MDVP:PPQ          195 non-null    float64
 8   Jitter:DDP        195 non-null    float64
 9   MDVP:Shimmer      195 non-null    float64
 10  MDVP:Shimmer(dB)  195 non-null    float64
 11  Shimmer:APQ3      195 non-null    float64
 12  Shimmer:APQ5      195 non-null    float64
 13  MDVP:APQ          195 non-null    float64
 14  Shimmer:DDA       195 non-null    float64
 15  NHR               195 non-null    float64
 16  HNR               195 non-null    float64
 17  status            195 non-null    int64  
 18  RPDE              195 non-null    float64
 19  DFA               195 non-null    float64
 20  spread1           195 non-null    float64
 21  spread2           195 non-null    float64
 22  D2                195 non-null    float64
 23  PPE               195 non-null    float64
dtypes: float64(22), int64(1), object(1)
memory usage: 36.7+ KB
  • Total 24 columns are present
  • Majority of columns are numeric and Float
  • status column is int and contins 0 & 1 represnting healthy and parkinson's

Attribute Information:

  • Matrix column entries (attributes):
  • name - ASCII subject name and recording number
  • MDVP:Fo(Hz) - Average vocal fundamental frequency
  • MDVP:Fhi(Hz) - Maximum vocal fundamental frequency
  • MDVP:Flo(Hz) - Minimum vocal fundamental frequency
  • MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several measures of variation in fundamental frequency
  • MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude
  • NHR,HNR - Two measures of ratio of noise to tonal components in the voice
  • RPDE,D2 - Two nonlinear dynamical complexity measures
  • DFA - Signal fractal scaling exponent
  • spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation

Target Column

  • status - Health status of the subject (one) - Parkinson's, (zero) - healthy
In [301]:
#df.describe(include='all')
df.describe().T
Out[301]:
count mean std min 25% 50% 75% max
MDVP:Fo(Hz) 195.0 154.228641 41.390065 88.333000 117.572000 148.790000 182.769000 260.105000
MDVP:Fhi(Hz) 195.0 197.104918 91.491548 102.145000 134.862500 175.829000 224.205500 592.030000
MDVP:Flo(Hz) 195.0 116.324631 43.521413 65.476000 84.291000 104.315000 140.018500 239.170000
MDVP:Jitter(%) 195.0 0.006220 0.004848 0.001680 0.003460 0.004940 0.007365 0.033160
MDVP:Jitter(Abs) 195.0 0.000044 0.000035 0.000007 0.000020 0.000030 0.000060 0.000260
MDVP:RAP 195.0 0.003306 0.002968 0.000680 0.001660 0.002500 0.003835 0.021440
MDVP:PPQ 195.0 0.003446 0.002759 0.000920 0.001860 0.002690 0.003955 0.019580
Jitter:DDP 195.0 0.009920 0.008903 0.002040 0.004985 0.007490 0.011505 0.064330
MDVP:Shimmer 195.0 0.029709 0.018857 0.009540 0.016505 0.022970 0.037885 0.119080
MDVP:Shimmer(dB) 195.0 0.282251 0.194877 0.085000 0.148500 0.221000 0.350000 1.302000
Shimmer:APQ3 195.0 0.015664 0.010153 0.004550 0.008245 0.012790 0.020265 0.056470
Shimmer:APQ5 195.0 0.017878 0.012024 0.005700 0.009580 0.013470 0.022380 0.079400
MDVP:APQ 195.0 0.024081 0.016947 0.007190 0.013080 0.018260 0.029400 0.137780
Shimmer:DDA 195.0 0.046993 0.030459 0.013640 0.024735 0.038360 0.060795 0.169420
NHR 195.0 0.024847 0.040418 0.000650 0.005925 0.011660 0.025640 0.314820
HNR 195.0 21.885974 4.425764 8.441000 19.198000 22.085000 25.075500 33.047000
status 195.0 0.753846 0.431878 0.000000 1.000000 1.000000 1.000000 1.000000
RPDE 195.0 0.498536 0.103942 0.256570 0.421306 0.495954 0.587562 0.685151
DFA 195.0 0.718099 0.055336 0.574282 0.674758 0.722254 0.761881 0.825288
spread1 195.0 -5.684397 1.090208 -7.964984 -6.450096 -5.720868 -5.046192 -2.434031
spread2 195.0 0.226510 0.083406 0.006274 0.174351 0.218885 0.279234 0.450493
D2 195.0 2.381826 0.382799 1.423287 2.099125 2.361532 2.636456 3.671155
PPE 195.0 0.206552 0.090119 0.044539 0.137451 0.194052 0.252980 0.527367
  • The rows of the CSV file contain an instance corresponding to one voice recording.
  • There are around six recordings per patient, the name of the patient is identified in the first column
  • Column values in tables are varying in scale and ranging from negative to positive
  • All column values shows real numbers and no abnormal details appeared so far
  • Name column is of not much importance for prediction
In [302]:
df.status.value_counts()
Out[302]:
1    147
0     48
Name: status, dtype: int64
In [303]:
df.status.value_counts(1)*100
Out[303]:
1    75.384615
0    24.615385
Name: status, dtype: float64
  • 75% of data is related for patient having Parkinson's
In [304]:
#Rearranging - moving status column towards end of file
stat=df.pop('status') 
df['status'] = stat

Univariate Analysis

  • Analyzing Frequencies
  • MDVP:Fo(Hz) - Average vocal fundamental frequency
  • MDVP:Fhi(Hz) - Maximum vocal fundamental frequency
  • MDVP:Flo(Hz) - Minimum vocal fundamental frequency
In [305]:
f, axes = plt.subplots(1, 3, figsize=(15,8), sharey='row')

sns.distplot(df['MDVP:Flo(Hz)'],ax=axes[0],axlabel="MDVP:Flo(Hz)-Min-" + "skew:" + str(skew(df['MDVP:Flo(Hz)'])))
sns.distplot(df['MDVP:Fo(Hz)'],ax=axes[1],axlabel="MDVP:Fo(Hz)-Avg-" + "skew:" +str(skew(df['MDVP:Fo(Hz)'])))
sns.distplot(df['MDVP:Fhi(Hz)'],ax=axes[2],axlabel="MDVP:Fhi(Hz)-Max-" + "skew:" +str(skew(df['MDVP:Fhi(Hz)'])))
plt.show()
  • Maximum vocal fundamental frequency is positively skewed, majority of the values are around 100-200 but few outliers beyond 300 is dragging its tail and making it poitively skewed. High outliers presence
  • Average vocal fundamental frequency is almost symmetrical with skewness around 0.6. Highest peak concentrtion of data is around 100-150
  • Minimum vocal fundamental frequency is comparativley more positivley skewed then average and less skewed than max. Majority of values are between 75-150
In [ ]:
 
  • Analyzing
  • MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several measures of variation in fundamental frequency
In [306]:
f, axes = plt.subplots(2,3,figsize=(15,8)) 
sns.distplot(df['MDVP:Jitter(%)'],ax=axes[0,0],axlabel="MDVP:Jitter(%)-" + "skew:" + str(skew(df['MDVP:Jitter(%)'])) )
sns.distplot(df['MDVP:Jitter(Abs)'],ax=axes[0,1],axlabel="MDVP:Shimmer(dB)-" + "skew:" + str(skew(df['MDVP:Jitter(Abs)'])) )
sns.distplot(df['MDVP:RAP'],ax=axes[0,2],axlabel="Shimmer:APQ3-" + "skew:" + str(skew(df['MDVP:RAP'])) )
sns.distplot(df['MDVP:PPQ'],ax=axes[1,0],axlabel="Shimmer:APQ5-" + "skew:" + str(skew(df['MDVP:PPQ'])) )
sns.distplot(df['Jitter:DDP'],ax=axes[1,1],axlabel="MDVP:APQ-" + "skew:" + str(skew(df['Jitter:DDP']))) 
Out[306]:
<AxesSubplot:xlabel='MDVP:APQ-skew:3.33614099997415'>
  • Positive skewness present in all columns in grpahs present
  • No major peaks present in longer tails representing the values are very less
  • HIgh outliers presence in all
In [ ]:
 
  • Analyzing
  • MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude
In [307]:
f, axes = plt.subplots(2,3,figsize=(15,8)) 
sns.distplot(df['MDVP:Shimmer'],ax=axes[0,0],axlabel="MDVP:Shimmer-" + "skew:" + str(skew(df['MDVP:Shimmer'])) )
sns.distplot(df['MDVP:Shimmer(dB)'],ax=axes[0,1],axlabel="MDVP:Shimmer(dB)-" + "skew:" + str(skew(df['MDVP:Shimmer(dB)'])) )
sns.distplot(df['Shimmer:APQ3'],ax=axes[0,2],axlabel="Shimmer:APQ3-" + "skew:" + str(skew(df['Shimmer:APQ3'])) )
sns.distplot(df['Shimmer:APQ5'],ax=axes[1,0],axlabel="Shimmer:APQ5-" + "skew:" + str(skew(df['Shimmer:APQ5'])) )
sns.distplot(df['MDVP:APQ'],ax=axes[1,1],axlabel="MDVP:APQ-" + "skew:" + str(skew(df['MDVP:APQ']))) 
sns.distplot(df['Shimmer:DDA'],ax=axes[1,2],axlabel="Shimmer:DDA-" + "skew:" + str(skew(df['Shimmer:DDA'])))
Out[307]:
<AxesSubplot:xlabel='Shimmer:DDA-skew:1.5684333201651859'>
  • Positive skewness present in all columns in grpahs present
  • No major peaks present in longer tails representing the values are very less
  • majority of peak number of values are close to zero
  • High outliers presence in all
In [ ]:
 
In [ ]:
 
  • Analyzing NHR & HNR - Noises
  • NHR,HNR - Two measures of ratio of noise to tonal components in the voice
In [308]:
f, axes = plt.subplots(1, 2, figsize=(15,8), sharey='row')

sns.distplot(df['NHR'],ax=axes[0],axlabel="NHR-" + "skew:" + str(skew(df['NHR'])))
sns.distplot(df['HNR'],ax=axes[1],axlabel="HNR-" + "skew:" +str(skew(df['HNR'])))
plt.show()
In [309]:
#same y-scale is not providing proper results
f, axes = plt.subplots(1, 2, figsize=(15,8))

sns.distplot(df['NHR'],ax=axes[0],axlabel="NHR-" + "skew:" + str(skew(df['NHR'])))
sns.distplot(df['HNR'],ax=axes[1],axlabel="HNR-" + "skew:" +str(skew(df['HNR'])))
plt.show()
  • for NHR, high positive skewness
  • but the peaks are very low representing not a lot of values are present on those range of values
  • majority of values are rangig between 0.00-0.05
  • More outlier presence for NHR compared to HNR
  • HNR has slight negetive skewness but overll looks normally ditributed
In [ ]:
 
  • Analyzing RPDE,D2
  • Two nonlinear dynamical complexity measures
In [310]:
f, axes = plt.subplots(1, 2, figsize=(15,8))

sns.distplot(df['RPDE'],ax=axes[0],axlabel="RPDE-" + "skew:" + str(skew(df['RPDE'])))
sns.distplot(df['D2'],ax=axes[1],axlabel="D2-" + "skew:" +str(skew(df['D2'])))
plt.show()
  • Both columns are less skewed and have close to normal distribution
In [ ]:
 
  • Analyzing DFA - Signal fractal scaling exponent
In [311]:
sns.distplot(df['DFA'],axlabel="DFA-" + "skew:" + str(skew(df['DFA'])))
Out[311]:
<AxesSubplot:xlabel='DFA-skew:-0.03295762313006091'>
  • Distribution is very less skewed and have close to normal distribution
In [ ]:
 
  • Analyzing spread1,spread2,PPE
  • Three nonlinear measures of fundamental frequency variation
In [312]:
f, axes = plt.subplots(1, 3, figsize=(15,8))

sns.distplot(df['spread1'],ax=axes[0],axlabel="spread1-" + "skew:" + str(skew(df['spread1'])))
sns.distplot(df['spread2'],ax=axes[1],axlabel="spread2-" + "skew:" +str(skew(df['spread2'])))
sns.distplot(df['PPE'],ax=axes[2],axlabel="PPE-" + "skew:" +str(skew(df['PPE'])))
plt.show()
  • skewness between0-1 for all. Almost normally distributed
  • slight outlier presence in PPE and spread1
In [ ]:
 

Bivariate Ananlysis

In [313]:
plt.figure(figsize=(15,10))
sns.heatmap(df.corr(), annot=True, fmt='.2f')
Out[313]:
<AxesSubplot:>
  • from correlation details we can observe almost all the attributes are ranging between 0.2-0.5 in positive side and 2 negative correlation at -0.38
  • MDVP:Fo & Flo are having same impact on Status of -0.38
  • similarly all attributes related to Jitter are showing similar impact on status
  • Similarly all atrributes related to Shimmer are showing similar behaviour with status
  • There is a high correlation between attributes related to Jitter and Shimmer, you can observe the light & dark orange square formations for these columns in heatmap
  • HNR has high -ve correlation with all attributes related to Jitter and Shimmer
  • Highest positive/negative correlation noticed is with spread1,PPE and spread2
In [314]:
g = sns.pairplot(df.drop(['name'],axis=1), hue="status", palette="husl")
  • Considering all the fields values with status, we can notice two distinguished peaks but are close mostly signifying the mode values for status are close by
  • Majority of fields has significant overlap area visible for status 0 and 1 in the kde plots
  • We have clear indicators in many attributes that low and high values are belonging to a certain group rather then covered by both status 0 and 1. This indicates the outliers or high or low values usually belong to one group rather than both making it ambigous.
In [ ]:
 
  • Analyzing Frequencies
  • MDVP:Fo(Hz) - Average vocal fundamental frequency
  • MDVP:Fhi(Hz) - Maximum vocal fundamental frequency
  • MDVP:Flo(Hz) - Minimum vocal fundamental frequency
In [315]:
f, axes = plt.subplots(1, 3, figsize=(15,8))

sns.boxplot(x=df['status'],y=df['MDVP:Fo(Hz)'],ax=axes[0])
sns.boxplot(x=df['status'],y=df['MDVP:Flo(Hz)'],ax=axes[1])
sns.boxplot(x=df['status'],y=df['MDVP:Fhi(Hz)'],ax=axes[2])


plt.show()
In [316]:
#Viewing on same Y axis
f, axes = plt.subplots(1, 3, figsize=(15,8),sharey='row')

sns.boxplot(x=df['status'],y=df['MDVP:Fo(Hz)'],ax=axes[0])
sns.boxplot(x=df['status'],y=df['MDVP:Flo(Hz)'],ax=axes[1])
sns.boxplot(x=df['status'],y=df['MDVP:Fhi(Hz)'],ax=axes[2])


plt.show()
  • Considering both the boxplots
  • There is a significant variation in median for status 0 and 1 for Averagea and Maximum vocal frequency.
  • Median for Average on staus 0 is close to 200 and for status 1 is around 150
  • Median for Maximum on staus 0 is close to 250 and for status 1 is around 150
  • Few outliers presence for Maximum vocal frequencies
  • For status 0 - box and whisker has much higher range compared to status 1 in all 3 vocal fundamental frequencies
  • There is big overlap in range of frequencies for status 0 and 1 but majorly the higher ranges are covered with status 0
  • For maximum vocal frequencies majority of outliers were belong to status 1.
In [ ]:
 
In [ ]:
 
  • Analyzing
  • MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several measures of variation in fundamental frequency
In [317]:
f, axes = plt.subplots(2,3,figsize=(15,15)) 
sns.boxplot(x=df['status'],y=df['MDVP:Jitter(%)'],ax=axes[0,0])
sns.boxplot(x=df['status'],y=df['MDVP:Jitter(Abs)'],ax=axes[0,1])
sns.boxplot(x=df['status'],y=df['MDVP:RAP'],ax=axes[0,2])
sns.boxplot(x=df['status'],y=df['MDVP:PPQ'],ax=axes[1,0])
sns.boxplot(x=df['status'],y=df['Jitter:DDP'],ax=axes[1,1]) 
Out[317]:
<AxesSubplot:xlabel='status', ylabel='Jitter:DDP'>
  • High presence of outliers in attributes realted to jitter
  • For status 1 - whiskers and box were having more range and majorly all outliers are belonging to status 1.
  • Majority of higher values and outliers are belonging to status 1
  • Suggests the higher presence jitter is a strong indicator of Parkinson's
In [ ]:
 
  • Analyzing
  • MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude
In [318]:
f, axes = plt.subplots(2, 3, figsize=(15,15))

sns.boxplot(x=df['status'],y=df['MDVP:Shimmer'],ax=axes[0,0] )
sns.boxplot(x=df['status'],y=df['MDVP:Shimmer(dB)'],ax=axes[0,1] )
sns.boxplot(x=df['status'],y=df['Shimmer:APQ3'],ax=axes[0,2])
sns.boxplot(x=df['status'],y=df['Shimmer:APQ5'],ax=axes[1,0])
sns.boxplot(x=df['status'],y=df['MDVP:APQ'],ax=axes[1,1]) 
sns.boxplot(x=df['status'],y=df['Shimmer:DDA'],ax=axes[1,2])
plt.show()
  • Noticing similar behaviour like JItter
  • whiskers and boxes are having higher ranges for status 1 compared to 0
  • Majority of higher values and outliers are belonging to status 1
In [ ]:
 
In [ ]:
 
  • Analyzing NHR & HNR - Noises
  • NHR,HNR - Two measures of ratio of noise to tonal components in the voice
In [319]:
f, axes = plt.subplots(1, 2, figsize=(15,8))

sns.boxplot(x=df['status'],y=df['NHR'],ax=axes[0])
sns.boxplot(x=df['status'],y=df['HNR'],ax=axes[1])
plt.show()
  • status 1 has higher range of whisker and box comapred to status 0 for both NHR and HNR
  • for NHR, majority of higher values or outliers are for status 1, beyond 0.15 all are for status 1
  • for HNR as well majority of outliers towards low range values are also for status 1.
In [ ]:
 
In [ ]:
 
  • Analyzing RPDE,D2
  • Two nonlinear dynamical complexity measures
In [320]:
f, axes = plt.subplots(1, 2, figsize=(15,8))

sns.boxplot(x=df['status'],y=df['RPDE'],ax=axes[0])
sns.boxplot(x=df['status'],y=df['D2'],ax=axes[1])
plt.show()
  • For RPDE, whisker and boxes are of similar range.
  • For D2, majority fo outliers wre belonging to status 1
In [ ]:
 
In [ ]:
 
  • Analyzing DFA - Signal fractal scaling exponent
In [321]:
sns.boxplot(x=df['status'],y=df['DFA'])
Out[321]:
<AxesSubplot:xlabel='status', ylabel='DFA'>
  • whisker is much bigger for status 1 compared to 0.
  • values above .8 and below 0.6 belongs to status 1
In [ ]:
 
In [ ]:
 
  • Analyzing spread1,spread2,PPE
  • Three nonlinear measures of fundamental frequency variation
In [322]:
f, axes = plt.subplots(1, 3, figsize=(15,8))

sns.boxplot(x=df['status'],y=df['spread1'],ax=axes[0])
sns.boxplot(x=df['status'],y=df['spread2'],ax=axes[1])
sns.boxplot(x=df['status'],y=df['PPE'],ax=axes[2])
plt.show()
  • High range values for spread1, spread2 and PPE are covered under status 1, and low range values for status 0.
In [ ]:
 

Null Values and Scaling

In [323]:
df.isnull().sum()
Out[323]:
name                0
MDVP:Fo(Hz)         0
MDVP:Fhi(Hz)        0
MDVP:Flo(Hz)        0
MDVP:Jitter(%)      0
MDVP:Jitter(Abs)    0
MDVP:RAP            0
MDVP:PPQ            0
Jitter:DDP          0
MDVP:Shimmer        0
MDVP:Shimmer(dB)    0
Shimmer:APQ3        0
Shimmer:APQ5        0
MDVP:APQ            0
Shimmer:DDA         0
NHR                 0
HNR                 0
RPDE                0
DFA                 0
spread1             0
spread2             0
D2                  0
PPE                 0
status              0
dtype: int64
In [324]:
df.isna().sum()
Out[324]:
name                0
MDVP:Fo(Hz)         0
MDVP:Fhi(Hz)        0
MDVP:Flo(Hz)        0
MDVP:Jitter(%)      0
MDVP:Jitter(Abs)    0
MDVP:RAP            0
MDVP:PPQ            0
Jitter:DDP          0
MDVP:Shimmer        0
MDVP:Shimmer(dB)    0
Shimmer:APQ3        0
Shimmer:APQ5        0
MDVP:APQ            0
Shimmer:DDA         0
NHR                 0
HNR                 0
RPDE                0
DFA                 0
spread1             0
spread2             0
D2                  0
PPE                 0
status              0
dtype: int64
  • No Null/nan values found
In [325]:
df=df.drop("name",axis=1)
In [326]:
scaler=MinMaxScaler((-1,1))
In [327]:
X=scaler.fit_transform( df.loc[:,df.columns != 'status'].values[:,1:])
y=df.loc[:,'status'].values
X[:,0:1]
Out[327]:
array([[-0.77481654],
       [-0.81013911],
       [-0.88174367],
       [-0.85414536],
       [-0.83818243],
       [-0.88153546],
       [-0.85670515],
       [-0.9522541 ],
       [-0.87783664],
       [-0.92668483],
       [-0.95878625],
       [-0.94396236],
       [-0.76434878],
       [-0.685665  ],
       [-0.75030875],
       [-0.52923645],
       [ 0.00886535],
       [-0.46911622],
       [-0.69917838],
       [-0.6437817 ],
       [-0.7403758 ],
       [-0.71129959],
       [-0.62817396],
       [-0.6301581 ],
       [-0.59706462],
       [-0.57599437],
       [-0.5665595 ],
       [-0.56497545],
       [-0.48870449],
       [-0.60725068],
       [-0.57234453],
       [-0.56166447],
       [-0.53843045],
       [-0.55312369],
       [-0.55344213],
       [-0.5573655 ],
       [-0.62939874],
       [-0.65927105],
       [-0.59539892],
       [-0.59101014],
       [-0.60979005],
       [-0.61463609],
       [-0.40728538],
       [-0.40112884],
       [-0.39264521],
       [-0.37581677],
       [-0.34701001],
       [-0.34947181],
       [-0.89195015],
       [-0.88607939],
       [-0.86558478],
       [-0.86900599],
       [-0.85340641],
       [-0.84599651],
       [-0.86727089],
       [-0.90114823],
       [-0.88192331],
       [-0.88662237],
       [-0.87853476],
       [-0.30935219],
       [-0.44742542],
       [-0.44133011],
       [-0.47252927],
       [-0.45916286],
       [-0.3873011 ],
       [-0.43906835],
       [-0.76472437],
       [-0.73682803],
       [-0.75475877],
       [-0.75227247],
       [-0.75397083],
       [-0.69605111],
       [-0.84663748],
       [ 0.98566194],
       [-0.89403227],
       [-0.9164457 ],
       [-0.80942058],
       [-0.90508385],
       [-1.        ],
       [-0.94467273],
       [-0.97338559],
       [-0.97725997],
       [-0.96785368],
       [-0.99934679],
       [-0.57779887],
       [-0.59998775],
       [-0.59049573],
       [-0.48871266],
       [-0.55416067],
       [-0.4946365 ],
       [-0.57596987],
       [-0.75018627],
       [-0.74343366],
       [-0.75780438],
       [-0.71083009],
       [-0.75046388],
       [-0.72741358],
       [-0.82934362],
       [-0.84317952],
       [-0.8382845 ],
       [-0.84109332],
       [-0.80279453],
       [ 0.97769681],
       [-0.78581096],
       [-0.76271166],
       [-0.76229932],
       [-0.74854915],
       [-0.7729263 ],
       [-0.77466548],
       [-0.56416302],
       [-0.50258938],
       [-0.51756024],
       [-0.51353889],
       [-0.46697286],
       [-0.4929014 ],
       [ 0.59526011],
       [ 0.3897629 ],
       [ 0.42115803],
       [ 0.39085296],
       [-0.46380885],
       [ 0.54139033],
       [-0.53806301],
       [-0.58611919],
       [-0.61203956],
       [-0.62047419],
       [-0.60821826],
       [-0.61177419],
       [-0.60471947],
       [-0.89635118],
       [-0.9000296 ],
       [-0.8938608 ],
       [-0.90544311],
       [-0.9058228 ],
       [-0.91190586],
       [-0.95659389],
       [-0.89603274],
       [-0.87262725],
       [-0.88517713],
       [-0.9001235 ],
       [-0.87921247],
       [-0.31963216],
       [-0.38088735],
       [-0.5217449 ],
       [-0.4718638 ],
       [-0.43168295],
       [-0.33973484],
       [-0.63414271],
       [-0.53185339],
       [-0.53394368],
       [ 0.89266869],
       [-0.55166621],
       [-0.50076447],
       [-0.4653684 ],
       [-0.84690693],
       [-0.89264011],
       [-0.89710238],
       [-0.83578187],
       [-0.86909581],
       [-0.7871378 ],
       [-0.8505486 ],
       [-0.90917052],
       [-0.86285353],
       [-0.89943354],
       [-0.87946559],
       [-0.83389979],
       [-0.41815732],
       [-0.42205211],
       [-0.33546036],
       [-0.52853425],
       [-0.41623034],
       [-0.30569419],
       [-0.87250477],
       [-0.95324617],
       [-0.94162712],
       [-0.82722067],
       [-0.91441257],
       [-0.89020689],
       [-0.64049114],
       [-0.77050124],
       [-0.78020556],
       [-0.74975351],
       [-0.75940068],
       [-0.7498515 ],
       [-0.91108117],
       [-0.52884044],
       [-0.69320963],
       [ 1.        ],
       [ 0.95614889],
       [-0.93050614],
       [-0.34449105],
       [-0.47402758],
       [-0.38405136],
       [-0.43717403],
       [ 0.20361309],
       [-0.35441175]])
In [328]:
#X = df.drop("status",axis=1)
#Y = df["status"]

Split the dataset into training and test set in the ratio of 70:30

In [329]:
X_train, X_test, y_train,  y_test = train_test_split(X, y,train_size=0.7, random_state=10)
print(len(X_train)),print(len(X_test))
136
59
Out[329]:
(None, None)
In [330]:
y_train.shape, y_test.shape
Out[330]:
((136,), (59,))
In [331]:
y_test[0:5]
Out[331]:
array([1, 1, 1, 1, 0], dtype=int64)
  • Data is scalled as we have lot of attributes having very diff range of values
  • Data set is distributed into training and test set
  • We have very less volume of data to train and even lesser volume to test
In [332]:
from sklearn import metrics
def draw_cm( actual, predicted ):
    cm = metrics.confusion_matrix( actual, predicted, [0,1] )
    sns.heatmap(cm, annot=True,  fmt='.0f', xticklabels = ["Status 0", "Status 1"] , yticklabels = ["Status 0", "Status 1"] )
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.show()
In [333]:
modelComp=pd.DataFrame()

Train and test Models

In [334]:
#Logistic Regression
In [335]:
from sklearn.linear_model import LogisticRegression
logRegModel=LogisticRegression()
logRegModel.fit(X_train,y_train)
y_predict=logRegModel.predict(X_test)

from sklearn.metrics import accuracy_score,confusion_matrix,recall_score,f1_score,precision_score,roc_curve,log_loss,auc
print('Accuracy score:',accuracy_score(y_test,y_predict))
print('confuion matrix:\n',confusion_matrix(y_test,y_predict))
print('Recall Score: ',recall_score(y_test, y_predict))
print('Precission Score: ',precision_score(y_test, y_predict))
print('F1 Score: ',f1_score(y_test, y_predict))
draw_cm(y_test, y_predict)
modelComp=pd.DataFrame({'Model':['Logistic Regression - 0.5'],'Accuracy':[accuracy_score(y_test,y_predict)*100],'Precission':[precision_score(y_test, y_predict)*100],'Recall':[recall_score(y_test, y_predict)*100]})
Accuracy score: 0.8813559322033898
confuion matrix:
 [[10  6]
 [ 1 42]]
Recall Score:  0.9767441860465116
Precission Score:  0.875
F1 Score:  0.923076923076923
In [ ]:
 
In [336]:
from sklearn.preprocessing import binarize
#changing the threshold to 0.4
y_pred_class = binarize([logRegModel.predict_proba(X_test)[:, 1]], 0.4)[0]

print('Accuracy score:',accuracy_score(y_test,y_pred_class))
print('confuion matrix:\n',confusion_matrix(y_test,y_pred_class))
print('Recall Score: ',recall_score(y_test, y_pred_class))
print('Precission Score: ',precision_score(y_test, y_pred_class))
print('F1 Score: ',f1_score(y_test, y_pred_class))
draw_cm(y_test, y_pred_class)
modelComp=modelComp.append(pd.DataFrame({'Model':['Logistic Regression - 0.4'],'Accuracy':[accuracy_score(y_test,y_pred_class)*100],'Precission':[precision_score(y_test, y_pred_class)*100],'Recall':[recall_score(y_test, y_pred_class)*100]}))
Accuracy score: 0.864406779661017
confuion matrix:
 [[ 8  8]
 [ 0 43]]
Recall Score:  1.0
Precission Score:  0.8431372549019608
F1 Score:  0.9148936170212766
In [337]:
#KNN
In [338]:
from sklearn.neighbors import KNeighborsClassifier
KnnModel = KNeighborsClassifier(n_neighbors=3)
KnnModel.fit(X_train,y_train)
y_predict=KnnModel.predict(X_test)

print('Accuracy score:',accuracy_score(y_test,y_predict))
print('confuion matrix:\n',confusion_matrix(y_test,y_predict))
print('Recall Score: ',recall_score(y_test, y_predict))
print('Precission Score: ',precision_score(y_test, y_predict))
print('F1 Score: ',f1_score(y_test, y_predict))

draw_cm(y_test, y_predict)
modelComp=modelComp.append(pd.DataFrame({'Model':['KNN - 3 Neigbours'],'Accuracy':[accuracy_score(y_test,y_predict)*100],'Precission':[precision_score(y_test, y_predict)*100],'Recall':[recall_score(y_test, y_predict)*100]}))
Accuracy score: 0.864406779661017
confuion matrix:
 [[10  6]
 [ 2 41]]
Recall Score:  0.9534883720930233
Precission Score:  0.8723404255319149
F1 Score:  0.9111111111111112
In [ ]:
 
In [339]:
#NaiveBayes Gaussian
In [340]:
from sklearn.naive_bayes import GaussianNB,BernoulliNB
NBGauModel = GaussianNB()

NBGauModel.fit(X_train,y_train)
y_predict=NBGauModel.predict(X_test)

print('Accuracy score:',accuracy_score(y_test,y_predict))
print('confuion matrix:\n',confusion_matrix(y_test,y_predict))
print('Recall Score: ',recall_score(y_test, y_predict))
print('Precission Score: ',precision_score(y_test, y_predict))
print('F1 Score: ',f1_score(y_test, y_predict))
draw_cm(y_test, y_predict)
modelComp=modelComp.append(pd.DataFrame({'Model':['Naive Bayes - Gaussian'],'Accuracy':[accuracy_score(y_test,y_predict)*100],'Precission':[precision_score(y_test, y_predict)*100],'Recall':[recall_score(y_test, y_predict)*100]}))
Accuracy score: 0.7796610169491526
confuion matrix:
 [[16  0]
 [13 30]]
Recall Score:  0.6976744186046512
Precission Score:  1.0
F1 Score:  0.8219178082191781
In [ ]:
 
In [341]:
#SVM
In [342]:
from sklearn.svm import SVC
clf = SVC(kernel='linear')

clf.fit(X_train,y_train)
y_predict=clf.predict(X_test)

print('Accuracy score:',accuracy_score(y_test,y_predict))
print('confuion matrix:\n',confusion_matrix(y_test,y_predict))
print('Recall Score: ',recall_score(y_test, y_predict))
print('Precission Score: ',precision_score(y_test, y_predict))
print('F1 Score: ',f1_score(y_test, y_predict))
draw_cm(y_test, y_predict)
modelComp=modelComp.append(pd.DataFrame({'Model':['SVC'],'Accuracy':[accuracy_score(y_test,y_predict)*100],'Precission':[precision_score(y_test, y_predict)*100],'Recall':[recall_score(y_test, y_predict)*100]}))
Accuracy score: 0.8983050847457628
confuion matrix:
 [[10  6]
 [ 0 43]]
Recall Score:  1.0
Precission Score:  0.8775510204081632
F1 Score:  0.9347826086956522
In [ ]:
 
In [343]:
#Comapring Accuracy precission and Recall 
modelComp
Out[343]:
Model Accuracy Precission Recall
0 Logistic Regression - 0.5 88.135593 87.500000 97.674419
0 Logistic Regression - 0.4 86.440678 84.313725 100.000000
0 KNN - 3 Neigbours 86.440678 87.234043 95.348837
0 Naive Bayes - Gaussian 77.966102 100.000000 69.767442
0 SVC 89.830508 87.755102 100.000000
  • Considering the standard classification algorithm, SVM and Logistic Regression provided us with best Accuracy Precission and Recall values
  • Updating the threshold to 0.4 for logistic is providing us with best recall values but its taking a hit on Precissiona dn overall accuracy as well.
  • Considering that SVM has a better Overall values with Recall
  • A 100% Recall score is good but we have a very small sample size and while using it on real world data might give us a unexpected or biased results. We should be looking at a model with best on all qalities and not a 100%.
In [ ]:
 
In [344]:
#META classifier
In [345]:
#Stacking
In [346]:
from sklearn.ensemble import StackingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

#Base Learners
#knn_clf = KNeighborsClassifier(n_neighbors=3)
#svc_clf = SVC(kernel='linear')
#lr_clf = LogisticRegression(max_iter=10000)
#nb_Gau_clf=GaussianNB()
estimators = [('knn_clf',KNeighborsClassifier(n_neighbors=3)),
              ('nb_clf',GaussianNB()),
             ('lr_clf',LogisticRegression()),
             ('svc_clf',SVC(kernel='linear'))]
lr = LogisticRegression(max_iter=10000) # meta classifier
sclf = StackingClassifier(estimators=estimators, final_estimator=lr)

sclf.fit(X_train,y_train)
y_predict=sclf.predict(X_test)



print('Accuracy score:',accuracy_score(y_test,y_predict))
print('confuion matrix:\n',confusion_matrix(y_test,y_predict))
print('Recall Score: ',recall_score(y_test, y_predict))
print('Precission Score: ',precision_score(y_test, y_predict))
print('F1 Score: ',f1_score(y_test, y_predict))
draw_cm(y_test, y_predict)
modelComp=modelComp.append(pd.DataFrame({'Model':['Stacking Classifier(KNN,NB,LR,SVC)LR'],'Accuracy':[accuracy_score(y_test,y_predict)*100],'Precission':[precision_score(y_test, y_predict)*100],'Recall':[recall_score(y_test, y_predict)*100]}))
Accuracy score: 0.8983050847457628
confuion matrix:
 [[11  5]
 [ 1 42]]
Recall Score:  0.9767441860465116
Precission Score:  0.8936170212765957
F1 Score:  0.9333333333333332
In [347]:
y_pred_proba = sclf.predict_proba(X_test)[:, 1]
[fpr, tpr, thr] = roc_curve(y_test, y_pred_proba)

plt.figure()
plt.plot(fpr, tpr, color='coral', label='ROC curve Stacking - LogReg(area = %0.3f)' % auc(fpr, tpr))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - specificity)', fontsize=14)
plt.ylabel('True Positive Rate (recall)', fontsize=14)
plt.title('Receiver operating characteristic (ROC) curve')
plt.legend(loc="lower right")
plt.show()
In [348]:
modelComp
Out[348]:
Model Accuracy Precission Recall
0 Logistic Regression - 0.5 88.135593 87.500000 97.674419
0 Logistic Regression - 0.4 86.440678 84.313725 100.000000
0 KNN - 3 Neigbours 86.440678 87.234043 95.348837
0 Naive Bayes - Gaussian 77.966102 100.000000 69.767442
0 SVC 89.830508 87.755102 100.000000
0 Stacking Classifier(KNN,NB,LR,SVC)LR 89.830508 89.361702 97.674419
In [ ]:
 
  • With Stacking Classifier
  • We can notice the stacking classifer has provided the best accuracy while trying to keep better Recall and preceission values
  • SVC has 100% Recall and NB has 100% Precission but overall accuracy was slight low
  • With stacking classifer we can observe it attained the highest Accuracy with next to best Precission and Recall. Only 1 value is missclassified from Status 1
In [349]:
#Changing the final estimator to SVC in stacking
In [350]:
estimators = [('knn_clf',KNeighborsClassifier(n_neighbors=3)),
              ('nb_clf',GaussianNB()),
             ('lr_clf',LogisticRegression()),
             ('svc_clf',SVC(kernel='linear'))]
lr = LogisticRegression() # meta classifier
sclf = StackingClassifier(estimators=estimators, final_estimator=SVC(kernel='linear'))

sclf.fit(X_train,y_train)
y_predict=sclf.predict(X_test)



print('Accuracy score:',accuracy_score(y_test,y_predict))
print('confuion matrix:\n',confusion_matrix(y_test,y_predict))
print('Recall Score: ',recall_score(y_test, y_predict))
print('Precission Score: ',precision_score(y_test, y_predict))
print('F1 Score: ',f1_score(y_test, y_predict))
draw_cm(y_test, y_predict)
modelComp=modelComp.append(pd.DataFrame({'Model':['Stacking Classifier(KNN,NB,LR,SVC)SVC'],'Accuracy':[accuracy_score(y_test,y_predict)*100],'Precission':[precision_score(y_test, y_predict)*100],'Recall':[recall_score(y_test, y_predict)*100]}))
Accuracy score: 0.8813559322033898
confuion matrix:
 [[10  6]
 [ 1 42]]
Recall Score:  0.9767441860465116
Precission Score:  0.875
F1 Score:  0.923076923076923
In [351]:
modelComp
Out[351]:
Model Accuracy Precission Recall
0 Logistic Regression - 0.5 88.135593 87.500000 97.674419
0 Logistic Regression - 0.4 86.440678 84.313725 100.000000
0 KNN - 3 Neigbours 86.440678 87.234043 95.348837
0 Naive Bayes - Gaussian 77.966102 100.000000 69.767442
0 SVC 89.830508 87.755102 100.000000
0 Stacking Classifier(KNN,NB,LR,SVC)LR 89.830508 89.361702 97.674419
0 Stacking Classifier(KNN,NB,LR,SVC)SVC 88.135593 87.500000 97.674419
In [352]:
#Changing the final estimator to GaussianNB
In [353]:
estimators = [('knn_clf',KNeighborsClassifier(n_neighbors=3)),
              ('nb_clf',GaussianNB()),
             ('lr_clf',LogisticRegression()),
             ('svc_clf',SVC(kernel='linear'))]
lr = LogisticRegression() # meta classifier
sclf = StackingClassifier(estimators=estimators, final_estimator=GaussianNB())

sclf.fit(X_train,y_train)
y_predict=sclf.predict(X_test)



print('Accuracy score:',accuracy_score(y_test,y_predict))
print('confuion matrix:\n',confusion_matrix(y_test,y_predict))
print('Recall Score: ',recall_score(y_test, y_predict))
print('Precission Score: ',precision_score(y_test, y_predict))
print('F1 Score: ',f1_score(y_test, y_predict))
draw_cm(y_test, y_predict)
modelComp=modelComp.append(pd.DataFrame({'Model':['Stacking Classifier(KNN,NB,LR,SVC)NB'],'Accuracy':[accuracy_score(y_test,y_predict)*100],'Precission':[precision_score(y_test, y_predict)*100],'Recall':[recall_score(y_test, y_predict)*100]}))
Accuracy score: 0.8983050847457628
confuion matrix:
 [[13  3]
 [ 3 40]]
Recall Score:  0.9302325581395349
Precission Score:  0.9302325581395349
F1 Score:  0.9302325581395349
In [354]:
modelComp
Out[354]:
Model Accuracy Precission Recall
0 Logistic Regression - 0.5 88.135593 87.500000 97.674419
0 Logistic Regression - 0.4 86.440678 84.313725 100.000000
0 KNN - 3 Neigbours 86.440678 87.234043 95.348837
0 Naive Bayes - Gaussian 77.966102 100.000000 69.767442
0 SVC 89.830508 87.755102 100.000000
0 Stacking Classifier(KNN,NB,LR,SVC)LR 89.830508 89.361702 97.674419
0 Stacking Classifier(KNN,NB,LR,SVC)SVC 88.135593 87.500000 97.674419
0 Stacking Classifier(KNN,NB,LR,SVC)NB 89.830508 93.023256 93.023256
In [ ]:
 
  • On stacking classifiers, with different Final estimator we can observe the algorithm is trying its best to provide a best result.
  • Stacking Classifier with Logistic is by far providing us with best results
In [ ]:
 
In [355]:
#Decission Tree
In [356]:
from sklearn.tree import DecisionTreeClassifier

dt_model = DecisionTreeClassifier(criterion='entropy',max_depth=5,random_state=10,min_samples_leaf=5)
dt_model.fit(X_train, y_train)
y_predict=dt_model.predict(X_test)



print('Accuracy score:',accuracy_score(y_test,y_predict))
print('confuion matrix:\n',confusion_matrix(y_test,y_predict))
print('Recall Score: ',recall_score(y_test, y_predict))
print('Precission Score: ',precision_score(y_test, y_predict))
print('F1 Score: ',f1_score(y_test, y_predict))
draw_cm(y_test, y_predict)
modelComp=modelComp.append(pd.DataFrame({'Model':['Decision Tree'],'Accuracy':[accuracy_score(y_test,y_predict)*100],'Precission':[precision_score(y_test, y_predict)*100],'Recall':[recall_score(y_test, y_predict)*100]}))
Accuracy score: 0.8983050847457628
confuion matrix:
 [[10  6]
 [ 0 43]]
Recall Score:  1.0
Precission Score:  0.8775510204081632
F1 Score:  0.9347826086956522
In [357]:
y_pred_proba = dt_model.predict_proba(X_test)[:, 1]
[fpr0, tpr0, thr0] = roc_curve(y_test, y_pred_proba)

plt.figure()
plt.plot(fpr, tpr, label='ROC curve Stacking - LogReg(area = %0.3f)' % auc(fpr, tpr))
plt.plot(fpr0, tpr0, label='ROC curve DecissionTree(area = %0.3f)' % auc(fpr0, tpr0))

plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - specificity)', fontsize=14)
plt.ylabel('True Positive Rate (recall)', fontsize=14)
plt.title('Receiver operating characteristic (ROC) curve')
plt.legend(loc="lower right")
plt.show()
In [358]:
modelComp
Out[358]:
Model Accuracy Precission Recall
0 Logistic Regression - 0.5 88.135593 87.500000 97.674419
0 Logistic Regression - 0.4 86.440678 84.313725 100.000000
0 KNN - 3 Neigbours 86.440678 87.234043 95.348837
0 Naive Bayes - Gaussian 77.966102 100.000000 69.767442
0 SVC 89.830508 87.755102 100.000000
0 Stacking Classifier(KNN,NB,LR,SVC)LR 89.830508 89.361702 97.674419
0 Stacking Classifier(KNN,NB,LR,SVC)SVC 88.135593 87.500000 97.674419
0 Stacking Classifier(KNN,NB,LR,SVC)NB 89.830508 93.023256 93.023256
0 Decision Tree 89.830508 87.755102 100.000000
  • Decission Tree with very minimal changes itself is able to provide us with a very better solution
In [359]:
#Bagging Classifer
In [360]:
from sklearn.ensemble import BaggingClassifier
bgclf = BaggingClassifier(base_estimator=dt_model, n_estimators=50, max_samples=.7)
bgclf = bgclf.fit(X_train, y_train)

y_predict=bgclf.predict(X_test)



print('Accuracy score:',accuracy_score(y_test,y_predict))
print('confuion matrix:\n',confusion_matrix(y_test,y_predict))
print('Recall Score: ',recall_score(y_test, y_predict))
print('Precission Score: ',precision_score(y_test, y_predict))
print('F1 Score: ',f1_score(y_test, y_predict))
draw_cm(y_test, y_predict)
modelComp=modelComp.append(pd.DataFrame({'Model':['BaggingClassifier'],'Accuracy':[accuracy_score(y_test,y_predict)*100],'Precission':[precision_score(y_test, y_predict)*100],'Recall':[recall_score(y_test, y_predict)*100]}))
Accuracy score: 0.9152542372881356
confuion matrix:
 [[12  4]
 [ 1 42]]
Recall Score:  0.9767441860465116
Precission Score:  0.9130434782608695
F1 Score:  0.9438202247191011
In [361]:
y_pred_proba = bgclf.predict_proba(X_test)[:, 1]
[fpr1, tpr1, thr1] = roc_curve(y_test, y_pred_proba)

plt.figure()
plt.plot(fpr, tpr, label='ROC curve Stacking - LogReg(area = %0.3f)' % auc(fpr, tpr))
plt.plot(fpr0, tpr0, label='ROC curve DecissionTree(area = %0.3f)' % auc(fpr0, tpr0))
plt.plot(fpr1, tpr1, label='ROC curve Bagging - DT(area = %0.3f)' % auc(fpr1, tpr1))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - specificity)', fontsize=14)
plt.ylabel('True Positive Rate (recall)', fontsize=14)
plt.title('Receiver operating characteristic (ROC) curve')
plt.legend(loc="lower right")
plt.show()
In [362]:
modelComp
Out[362]:
Model Accuracy Precission Recall
0 Logistic Regression - 0.5 88.135593 87.500000 97.674419
0 Logistic Regression - 0.4 86.440678 84.313725 100.000000
0 KNN - 3 Neigbours 86.440678 87.234043 95.348837
0 Naive Bayes - Gaussian 77.966102 100.000000 69.767442
0 SVC 89.830508 87.755102 100.000000
0 Stacking Classifier(KNN,NB,LR,SVC)LR 89.830508 89.361702 97.674419
0 Stacking Classifier(KNN,NB,LR,SVC)SVC 88.135593 87.500000 97.674419
0 Stacking Classifier(KNN,NB,LR,SVC)NB 89.830508 93.023256 93.023256
0 Decision Tree 89.830508 87.755102 100.000000
0 BaggingClassifier 91.525424 91.304348 97.674419
In [ ]:
 
  • With Bagging Classifer we can observe it is able to provide us best Accuracy, without hampering too much on Precission and Recall rate
In [ ]:
 
In [ ]:
 
In [363]:
#Applying Ensemble Model
In [364]:
#Random Forest Classifier
In [365]:
from sklearn.ensemble import RandomForestClassifier
rfclf = RandomForestClassifier(n_estimators = 50)
rfclf.fit(X_train, y_train)

y_predict=rfclf.predict(X_test)



print('Accuracy score:',accuracy_score(y_test,y_predict))
print('confuion matrix:\n',confusion_matrix(y_test,y_predict))
print('Recall Score: ',recall_score(y_test, y_predict))
print('Precission Score: ',precision_score(y_test, y_predict))
print('F1 Score: ',f1_score(y_test, y_predict))
draw_cm(y_test, y_predict)
modelComp=modelComp.append(pd.DataFrame({'Model':['Random Forest'],'Accuracy':[accuracy_score(y_test,y_predict)*100],'Precission':[precision_score(y_test, y_predict)*100],'Recall':[recall_score(y_test, y_predict)*100]}))
Accuracy score: 0.9322033898305084
confuion matrix:
 [[13  3]
 [ 1 42]]
Recall Score:  0.9767441860465116
Precission Score:  0.9333333333333333
F1 Score:  0.9545454545454545
In [366]:
y_pred_proba = rfclf.predict_proba(X_test)[:, 1]
[fpr2, tpr2, thr2] = roc_curve(y_test, y_pred_proba)

plt.figure()
plt.plot(fpr, tpr, label='ROC curve Stacking - LogReg(area = %0.3f)' % auc(fpr, tpr))
plt.plot(fpr0, tpr0, label='ROC curve DecissionTree(area = %0.3f)' % auc(fpr0, tpr0))
plt.plot(fpr1, tpr1, label='ROC curve Bagging - DT(area = %0.3f)' % auc(fpr1, tpr1))
plt.plot(fpr2, tpr2, label='ROC curve RandomForest(area = %0.3f)' % auc(fpr2, tpr2))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - specificity)', fontsize=14)
plt.ylabel('True Positive Rate (recall)', fontsize=14)
plt.title('Receiver operating characteristic (ROC) curve')
plt.legend(loc="lower right")
plt.show()
In [367]:
modelComp
Out[367]:
Model Accuracy Precission Recall
0 Logistic Regression - 0.5 88.135593 87.500000 97.674419
0 Logistic Regression - 0.4 86.440678 84.313725 100.000000
0 KNN - 3 Neigbours 86.440678 87.234043 95.348837
0 Naive Bayes - Gaussian 77.966102 100.000000 69.767442
0 SVC 89.830508 87.755102 100.000000
0 Stacking Classifier(KNN,NB,LR,SVC)LR 89.830508 89.361702 97.674419
0 Stacking Classifier(KNN,NB,LR,SVC)SVC 88.135593 87.500000 97.674419
0 Stacking Classifier(KNN,NB,LR,SVC)NB 89.830508 93.023256 93.023256
0 Decision Tree 89.830508 87.755102 100.000000
0 BaggingClassifier 91.525424 91.304348 97.674419
0 Random Forest 93.220339 93.333333 97.674419
  • Random Forest & Bagging Classifier is able to provide us with best solution so far with 93.2% accuracy, 93% precission and 97.67% Recall.
  • Total 4 values are missclassifed but only 1 missclassification for status 1
In [ ]:
 
In [368]:
# ADABOOST
In [369]:
from sklearn.ensemble import AdaBoostClassifier
abcl = AdaBoostClassifier( n_estimators= 20) 
abcl.fit(X_train, y_train)
y_predict=abcl.predict(X_test)



print('Accuracy score:',accuracy_score(y_test,y_predict))
print('confuion matrix:\n',confusion_matrix(y_test,y_predict))
print('Recall Score: ',recall_score(y_test, y_predict))
print('Precission Score: ',precision_score(y_test, y_predict))
print('F1 Score: ',f1_score(y_test, y_predict))
draw_cm(y_test, y_predict)
modelComp=modelComp.append(pd.DataFrame({'Model':['Ada Boost'],'Accuracy':[accuracy_score(y_test,y_predict)*100],'Precission':[precision_score(y_test, y_predict)*100],'Recall':[recall_score(y_test, y_predict)*100]}))
Accuracy score: 0.9322033898305084
confuion matrix:
 [[14  2]
 [ 2 41]]
Recall Score:  0.9534883720930233
Precission Score:  0.9534883720930233
F1 Score:  0.9534883720930233
In [370]:
y_pred_proba = abcl.predict_proba(X_test)[:, 1]
[fpr3_0, tpr3_0, thr3_0] = roc_curve(y_test, y_pred_proba)

plt.figure(figsize=(8,8))
plt.plot(fpr, tpr, label='ROC curve Stacking - LogReg(area = %0.3f)' % auc(fpr, tpr))
plt.plot(fpr0, tpr0, label='ROC curve DecissionTree(area = %0.3f)' % auc(fpr0, tpr0))
plt.plot(fpr1, tpr1, label='ROC curve Bagging - DT(area = %0.3f)' % auc(fpr1, tpr1))
plt.plot(fpr2, tpr2, label='ROC curve RandomForest(area = %0.3f)' % auc(fpr2, tpr2))
plt.plot(fpr3_0, tpr3_0, label='ROC curve AdaBost(area = %0.3f)' % auc(fpr3_0, tpr3_0))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - specificity)', fontsize=14)
plt.ylabel('True Positive Rate (recall)', fontsize=14)
plt.title('Receiver operating characteristic (ROC) curve')
plt.legend(loc="lower right")
plt.show()
In [371]:
modelComp
Out[371]:
Model Accuracy Precission Recall
0 Logistic Regression - 0.5 88.135593 87.500000 97.674419
0 Logistic Regression - 0.4 86.440678 84.313725 100.000000
0 KNN - 3 Neigbours 86.440678 87.234043 95.348837
0 Naive Bayes - Gaussian 77.966102 100.000000 69.767442
0 SVC 89.830508 87.755102 100.000000
0 Stacking Classifier(KNN,NB,LR,SVC)LR 89.830508 89.361702 97.674419
0 Stacking Classifier(KNN,NB,LR,SVC)SVC 88.135593 87.500000 97.674419
0 Stacking Classifier(KNN,NB,LR,SVC)NB 89.830508 93.023256 93.023256
0 Decision Tree 89.830508 87.755102 100.000000
0 BaggingClassifier 91.525424 91.304348 97.674419
0 Random Forest 93.220339 93.333333 97.674419
0 Ada Boost 93.220339 95.348837 95.348837
In [372]:
#Gradient Boost
In [373]:
from sklearn.ensemble import GradientBoostingClassifier
gbcl = GradientBoostingClassifier(n_estimators = 1000, learning_rate = 0.001)
gbcl = gbcl.fit(X_train, y_train)
y_predict=gbcl.predict(X_test)



print('Accuracy score:',accuracy_score(y_test,y_predict))
print('confuion matrix:\n',confusion_matrix(y_test,y_predict))
print('Recall Score: ',recall_score(y_test, y_predict))
print('Precission Score: ',precision_score(y_test, y_predict))
print('F1 Score: ',f1_score(y_test, y_predict))
draw_cm(y_test, y_predict)
modelComp=modelComp.append(pd.DataFrame({'Model':['GradientBoostingClassifier'],'Accuracy':[accuracy_score(y_test,y_predict)*100],'Precission':[precision_score(y_test, y_predict)*100],'Recall':[recall_score(y_test, y_predict)*100]}))
Accuracy score: 0.8983050847457628
confuion matrix:
 [[11  5]
 [ 1 42]]
Recall Score:  0.9767441860465116
Precission Score:  0.8936170212765957
F1 Score:  0.9333333333333332
In [374]:
y_pred_proba = gbcl.predict_proba(X_test)[:, 1]
[fpr3, tpr3, thr3] = roc_curve(y_test, y_pred_proba)

plt.figure(figsize=(8,4))
plt.plot(fpr, tpr, label='ROC curve Stacking - LogReg(area = %0.3f)' % auc(fpr, tpr))
plt.plot(fpr0, tpr0, label='ROC curve DecissionTree(area = %0.3f)' % auc(fpr0, tpr0))
plt.plot(fpr1, tpr1, label='ROC curve Bagging - DT(area = %0.3f)' % auc(fpr1, tpr1))
plt.plot(fpr2, tpr2, label='ROC curve RandomForest(area = %0.3f)' % auc(fpr2, tpr2))
plt.plot(fpr3_0, tpr3_0, label='ROC curve AdaBost(area = %0.3f)' % auc(fpr3_0, tpr3_0))
plt.plot(fpr3, tpr3, label='ROC curve GradientBoost(area = %0.3f)' % auc(fpr3, tpr3))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - specificity)', fontsize=14)
plt.ylabel('True Positive Rate (recall)', fontsize=14)
plt.title('Receiver operating characteristic (ROC) curve')
plt.legend(loc="lower right")
plt.show()
In [375]:
modelComp
Out[375]:
Model Accuracy Precission Recall
0 Logistic Regression - 0.5 88.135593 87.500000 97.674419
0 Logistic Regression - 0.4 86.440678 84.313725 100.000000
0 KNN - 3 Neigbours 86.440678 87.234043 95.348837
0 Naive Bayes - Gaussian 77.966102 100.000000 69.767442
0 SVC 89.830508 87.755102 100.000000
0 Stacking Classifier(KNN,NB,LR,SVC)LR 89.830508 89.361702 97.674419
0 Stacking Classifier(KNN,NB,LR,SVC)SVC 88.135593 87.500000 97.674419
0 Stacking Classifier(KNN,NB,LR,SVC)NB 89.830508 93.023256 93.023256
0 Decision Tree 89.830508 87.755102 100.000000
0 BaggingClassifier 91.525424 91.304348 97.674419
0 Random Forest 93.220339 93.333333 97.674419
0 Ada Boost 93.220339 95.348837 95.348837
0 GradientBoostingClassifier 89.830508 89.361702 97.674419
  • Cosiderin all the Model's we can clearly notice the Ensemble models are providing better overall results
  • Among the Classifers, Random Forest, AdaBoost, Bagging classifiers has provided us with best Accuracy, precission and recall in a single package. Mostly are based on decission trees.
  • ROC curve for AdaBoost, Random Forest and Stacking - with final estimator as logistic regression has best AUC. If we observe the shape of the ROC for these graphs we can notice a slight different patttern for those. Classifiers that give curves closer to the top-left corner indicate a better performance.
  • ROC curve for Decission tree is a lot diffeent then other models, the reason why it occurs in a decision tree is that you often do binary splits; this is efficient computationally, but only gives 2^n groupings. Unless your n number of splits are very large, you'll only have 16/32/64/128 groups, whereas if you used an algorithm such as logistic regression and used continous variables, your prediction would fall in the continous range between 0 and 1. As our model can only provide discrete predictions, rather than a continous score. This can often be remedied by adding more samples to your dataset, having more continous features in the model, more features in general or using a model specification that provides a continous prediction output.
  • Considering few of the top models with best accuracy and recall, the ROC curve (area), we can can notice AdaBoost, Random Forest and Stacking - LogReg have close values, but Random Forest has much better overall accuracy along with AUC and hence we are considering it as our best model for the Task.